44 research outputs found
Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification
Sigmoid output layers are widely used in multi-label classification (MLC)
tasks, in which multiple labels can be assigned to any input. In many practical
MLC tasks, the number of possible labels is in the thousands, often exceeding
the number of input features and resulting in a low-rank output layer. In
multi-class classification, it is known that such a low-rank output layer is a
bottleneck that can result in unargmaxable classes: classes which cannot be
predicted for any input. In this paper, we show that for MLC tasks, the
analogous sigmoid bottleneck results in exponentially many unargmaxable label
combinations. We explain how to detect these unargmaxable outputs and
demonstrate their presence in three widely used MLC datasets. We then show that
they can be prevented in practice by introducing a Discrete Fourier Transform
(DFT) output layer, which guarantees that all sparse label combinations with up
to active labels are argmaxable. Our DFT layer trains faster and is more
parameter efficient, matching the F1@k score of a sigmoid layer while using up
to 50% fewer trainable parameters. Our code is publicly available at
https://github.com/andreasgrv/sigmoid-bottleneck.Comment: Published at AAAI2
Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification
Sigmoid output layers are widely used in multi-label classification (MLC) tasks, in which multiple labels can be assigned to any input. In many practical MLC tasks, the number of possible labels is in the thousands, often exceeding the number of input features and resulting in a low-rank output layer. In multi-class classification, it is known that such a lowrank output layer is a bottleneck that can result in unargmaxable classes: classes which cannot be predicted for any input. In this paper, we show that for MLC tasks, the analogous sigmoid bottleneck results in exponentially many unargmaxable label combinations. We explain how to detect these unargmaxable outputs and demonstrate their presence in three widely used MLC datasets. We then show that they can be prevented in practice by introducing a Discrete Fourier Transform (DFT) output layer, which guarantees that all sparse label combinations with up to k active labels are argmaxable. Our DFT layer trains faster and is more parameter efficient, matching the F1@k score of a sigmoid layer while using up to 50% fewer trainable parameters. Our code is publicly available at https://github.com/andreasgrv/sigmoid-bottleneck
Low-Rank Softmax Can Have Unargmaxable Classes in Theory but Rarely in Practice
Classifiers in natural language processing (NLP) often have a large number of output classes. For example, neural language models (LMs) and machine translation (MT) models both predict tokens from a vocabulary of thousands. The Softmax output layer of these models typically receives as input a dense feature representation, which has much lower dimensionality than the output. In theory, the result is some words may be impossible to be predicted via argmax, irrespective of input features, and empirically, there is evidence this happens in small language models (Demeter et al., 2020). In this paper we ask whether it can happen in practical large language models and translation models. To do so, we develop algorithms to detect such unargmaxable tokens in public models. We find that 13 out of 150 models do indeed have such tokens; however, they are very infrequent and unlikely to impact model quality. We release our algorithms and code to the public
What do character-level models learn about morphology? The case of dependency parsing
When parsing morphologically-rich languages with neural models, it is
beneficial to model input at the character level, and it has been claimed that
this is because character-level models learn morphology. We test these claims
by comparing character-level models to an oracle with access to explicit
morphological analysis on twelve languages with varying morphological
typologies. Our results highlight many strengths of character-level models, but
also show that they are poor at disambiguating some words, particularly in the
face of case syncretism. We then demonstrate that explicitly modeling
morphological case improves our best model, showing that character-level models
can benefit from targeted forms of explicit morphological modeling.Comment: EMNLP 201
A systematic review of natural language processing applied to radiology reports
NLP has a significant role in advancing healthcare and has been found to be
key in extracting structured information from radiology reports. Understanding
recent developments in NLP application to radiology is of significance but
recent reviews on this are limited. This study systematically assesses recent
literature in NLP applied to radiology reports. Our automated literature search
yields 4,799 results using automated filtering, metadata enriching steps and
citation search combined with manual review. Our analysis is based on 21
variables including radiology characteristics, NLP methodology, performance,
study, and clinical application characteristics. We present a comprehensive
analysis of the 164 publications retrieved with each categorised into one of 6
clinical application categories. Deep learning use increases but conventional
machine learning approaches are still prevalent. Deep learning remains
challenged when data is scarce and there is little evidence of adoption into
clinical practice. Despite 17% of studies reporting greater than 0.85 F1
scores, it is hard to comparatively evaluate these approaches given that most
of them use different datasets. Only 14 studies made their data and 15 their
code available with 10 externally validating results. Automated understanding
of clinical narratives of the radiology reports has the potential to enhance
the healthcare process but reproducibility and explainability of models are
important if the domain is to move applications into clinical use. More could
be done to share code enabling validation of methods on different institutional
data and to reduce heterogeneity in reporting of study properties allowing
inter-study comparisons. Our results have significance for researchers
providing a systematic synthesis of existing work to build on, identify gaps,
opportunities for collaboration and avoid duplication
The reporting quality of natural language processing studies - systematic review of studies of radiology reports
Abstract Background Automated language analysis of radiology reports using natural language processing (NLP) can provide valuable information on patients’ health and disease. With its rapid development, NLP studies should have transparent methodology to allow comparison of approaches and reproducibility. This systematic review aims to summarise the characteristics and reporting quality of studies applying NLP to radiology reports. Methods We searched Google Scholar for studies published in English that applied NLP to radiology reports of any imaging modality between January 2015 and October 2019. At least two reviewers independently performed screening and completed data extraction. We specified 15 criteria relating to data source, datasets, ground truth, outcomes, and reproducibility for quality assessment. The primary NLP performance measures were precision, recall and F1 score. Results Of the 4,836 records retrieved, we included 164 studies that used NLP on radiology reports. The commonest clinical applications of NLP were disease information or classification (28%) and diagnostic surveillance (27.4%). Most studies used English radiology reports (86%). Reports from mixed imaging modalities were used in 28% of the studies. Oncology (24%) was the most frequent disease area. Most studies had dataset size > 200 (85.4%) but the proportion of studies that described their annotated, training, validation, and test set were 67.1%, 63.4%, 45.7%, and 67.7% respectively. About half of the studies reported precision (48.8%) and recall (53.7%). Few studies reported external validation performed (10.8%), data availability (8.5%) and code availability (9.1%). There was no pattern of performance associated with the overall reporting quality. Conclusions There is a range of potential clinical applications for NLP of radiology reports in health services and research. However, we found suboptimal reporting quality that precludes comparison, reproducibility, and replication. Our results support the need for development of reporting standards specific to clinical NLP studies
Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports
BACKGROUND: Natural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications. METHODS: We tested the F1 score, precision, and recall to compare NLP tools on a cohort from a study on delirium using images and radiology reports from NHS Fife and a population-based cohort (Generation Scotland) that spans multiple National Health Service health boards. We compared four off-the-shelf rule-based and neural NLP tools (namely, EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and reported on their performance for three cerebrovascular phenotypes, namely, ischaemic stroke, small vessel disease (SVD), and atrophy. Clinical experts from the EdIE-R team defined phenotypes using labelling techniques developed in the development of EdIE-R, in conjunction with an expert researcher who read underlying images. RESULTS: EdIE-R obtained the highest F1 score in both cohorts for ischaemic stroke, ≥93%, followed by ALARM+, ≥87%. The F1 score of ESPRESSO was ≥74%, whilst that of Sem-EHR is ≥66%, although ESPRESSO had the highest precision in both cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R scored ≥98% and ALARM+ ≥90%. ESPRESSO scored lowest with ≥77% and Sem-EHR ≥81%. In NHS Fife, F1 scores for atrophy by EdIE-R and ALARM+ were 99%, dropping in Generation Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest for atrophy at 89% in NHS Fife and 73% in Generation Scotland. When comparing NLP tool output with brain image reads using F1 scores, ALARM+ scored 80%, outperforming EdIE-R at 66% in ischaemic stroke. For SVD, EdIE-R performed best, scoring 84%, with Sem-EHR 82%. For atrophy, EdIE-R and both ALARM+ versions were comparable at 80%. CONCLUSIONS: The four NLP tools show varying F1 (and precision/recall) scores across all three phenotypes, although more apparent for ischaemic stroke. If NLP tools are to be used in clinical settings, this cannot be performed "out of the box." It is essential to understand the context of their development to assess whether they are suitable for the task at hand or whether further training, re-training, or modification is required to adapt tools to the target task
Understanding the performance and reliability of NLP tools:A comparison of four NLP tools predicting stroke phenotypes in radiology reports
Background: Natural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications.
Methods: We tested the F1 score, precision, and recall to compare NLP tools on a cohort from a study on delirium using images and radiology reports from NHS Fife and a population-based cohort (Generation Scotland) that spans multiple National Health Service health boards. We compared four off-the-shelf rule-based and neural NLP tools (namely, EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and reported on their performance for three cerebrovascular phenotypes, namely, ischaemic stroke, small vessel disease (SVD), and atrophy. Clinical experts from the EdIE-R team defined phenotypes using labelling techniques developed in the development of EdIE-R, in conjunction with an expert researcher who read underlying images.
Results: EdIE-R obtained the highest F1 score in both cohorts for ischaemic stroke, ≥93%, followed by ALARM+, ≥87%. The F1 score of ESPRESSO was ≥74%, whilst that of Sem-EHR is ≥66%, although ESPRESSO had the highest precision in both cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R scored ≥98% and ALARM+ ≥90%. ESPRESSO scored lowest with ≥77% and Sem-EHR ≥81%. In NHS Fife, F1 scores for atrophy by EdIE-R and ALARM+ were 99%, dropping in Generation Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest for atrophy at 89% in NHS Fife and 73% in Generation Scotland. When comparing NLP tool output with brain image reads using F1 scores, ALARM+ scored 80%, outperforming EdIE-R at 66% in ischaemic stroke. For SVD, EdIE-R performed best, scoring 84%, with Sem-EHR 82%. For atrophy, EdIE-R and both ALARM+ versions were comparable at 80%.
Conclusions: The four NLP tools show varying F1 (and precision/recall) scores across all three phenotypes, although more apparent for ischaemic stroke. If NLP tools are to be used in clinical settings, this cannot be performed “out of the box.” It is essential to understand the context of their development to assess whether they are suitable for the task at hand or whether further training, re-training, or modification is required to adapt tools to the target task